-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PCRE2: optimize memory allocations #15395
base: master
Are you sure you want to change the base?
PCRE2: optimize memory allocations #15395
Conversation
Damn, that was promising, and it's working well on Linux, but both Darwin and Windows completely broke 😭 Windows could be explained by using |
WTH: I can't reproduce the errors in my Windows VM (MSVC) 😕 |
I can reproduce when running the compiler specs. Looking at the memory graph, the committed memory keeps growing rapidly, and it starts failing exactly when reaching 32GB (the VM has 16GB of RAM). That can't be a coincidence. When running the compiler specs from master, the committed memory is rather stable or grows slowly (around 3.5GB). Let's note that a linker command still fails, but I get a bunch of weird spec errors in the VM. Maybe |
Some targets don't support the ThreadLocal annotation (for example OpenBSD, Android)
7ceca47
to
7487cb5
Compare
Here's a new attempt. From CI runs in my fork, the darwin and win32-msvc targets got fixed, but win32-gnu is still broken on dual heredocs such as Note: I believe FLS should work on Windows, which would avoid the custom destructors. The stdlib specs pass, save for these failures in the compiler specs. |
Yeah that's the LLVM 20 failure, ignore it for the moment |
I tried again, and FLS is apparently working on Windows. Nice. |
src/crystal/system/thread.cr
Outdated
Thread.threads.delete(self) | ||
Fiber.inactive(fiber) | ||
detach { system_close } | ||
end | ||
end | ||
|
||
protected def run_destructors : Nil | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: cleanup before merge (not needed anymore).
# def initialize | ||
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
# end | ||
|
||
# def initialize(&destructor : Proc(T, Nil)) | ||
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
# end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps these could be uncommented and then called via super
in descendant classes to avoid duplicating the code?
# def initialize | |
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
# end | |
# def initialize(&destructor : Proc(T, Nil)) | |
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
# end | |
def initialize | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
end | |
def initialize(&destructor : Proc(T, Nil)) | |
initialize | |
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It avoids some repetition, but I'm not sure it would improve readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we could use Box
for this? Seems to be exactly the use case. Technically we don't need support for boxing structs onto the heap though. So this draws in additional complexity which we might want to avoid here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you lost me @straight-shoota 😕
Do you mean to Oh, to Box
the destructor proc?Box
the value in case it's not a Pointer/Reference! 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine to not support non pointer values. If we ever need to, we'll figure it out then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for using Box
would be that it might simplify the code for Thread::Local
. All explicit casts would become Box(T).box
and Box(T).unbox
and no need for checking T
type in the macro.
I suppose it doesn't simplify that much though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But Box
always allocates into the GC heap, and we'd put the pointer to the box in the local storage that we'd have to dereference on every get
😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errata: Box doesn't allocate into the GC heap when T is Nil or Reference but we want to store pointers and that will allocate.
Maybe it's an overlook? 🤔
def initialize | ||
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
err = LibC.pthread_key_create(out @key, nil) | ||
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0 | ||
end | ||
|
||
def initialize(&destructor : Proc(T, Nil)) | ||
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil))) | ||
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Extract the common code into a helper to remove duplication.
def initialize | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
err = LibC.pthread_key_create(out @key, nil) | |
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0 | |
end | |
def initialize(&destructor : Proc(T, Nil)) | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil))) | |
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0 | |
end | |
def initialize | |
create_key | |
end | |
def initialize(&destructor : Proc(T, Nil)) | |
create_key destructor | |
end | |
private def create_key(destructor = nil) | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
err = LibC.pthread_key_create(out @key, destructor.unsafe_as(Proc(Void*, Nil))) | |
raise RuntimeError.from_os_error("pthread_key_create", Errno.new(err)) unless err == 0 | |
end |
ditto for win32
def initialize | ||
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
@key = LibC.FlsAlloc | ||
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES | ||
end | ||
|
||
def initialize(&destructor : Proc(T, Nil)) | ||
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION)) | ||
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def initialize | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
@key = LibC.FlsAlloc | |
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES | |
end | |
def initialize(&destructor : Proc(T, Nil)) | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION)) | |
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES | |
end | |
def initialize | |
create_key nil | |
end | |
def initialize(&destructor : Proc(T, Nil)) | |
create_key destructor | |
end | |
private def create_key(destructor) | |
{% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | |
@key = LibC.FlsAlloc(destructor.unsafe_as(LibC::FLS_CALLBACK_FUNCTION)) | |
raise RuntimeError.from_winerror("FlsAlloc: out of indexes") if @key == LibC::FLS_OUT_OF_INDEXES | |
end |
# def initialize | ||
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
# end | ||
|
||
# def initialize(&destructor : Proc(T, Nil)) | ||
# {% raise "T must be a Reference or Pointer" unless T < Reference || T < Pointer %} | ||
# end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea for using Box
would be that it might simplify the code for Thread::Local
. All explicit casts would become Box(T).box
and Box(T).unbox
and no need for checking T
type in the macro.
I suppose it doesn't simplify that much though.
Moving back to draft. I'll be pushing individual PRs with the changes proposed here:
A branch with all the changes is available at https://github.com/ysbaddaden/crystal/tree/feature/add-thread-local-struct |
We noticed in #15088 that we don't need
Crystal::ThreadLocalValue
in the Regex PCRE2 engine.We can reuse the JIT stack but also the match data for every Regex (no need for a specific match data per Regex). We can allocate one and make sure it's not used by other threads... hence the thread locals: no more spinlock (thread contention) nor hash.
It's simpler and faster. Here's the benchmark from #13144 for example:
Enabling MT also no longer has any impact on performance:
The drawback is that we must allocate each matchdata with a maximum number of ovectors (65535). That might increase memory usage, though I failed to notice it in practice. Maybe not allocating memory for every regular expression is helping?
Note: this PR will be separated into a couple PRs to introduce
Thread::Local(T)
separately. The point of this new type is for this patch, so I want approval on the overall approach before the split.I could have used
@[ThreadLocal]
but some targets don't support it (namely: Android, MinGW and OpenBSD) and we can't register destructors either (on thread shutdown). But usingpthread_key_create
orFlsAlloc
we can 😍Note 2: using a type and a system destructor allows to avoid adding a property to
Thread
that would then need a finalizer that would hardlink and force require PCRE2 into all Crystal applications again. Using system destructors also means that the destructor will always run, even if the thread is cancelled or whatever.Note 3: I'm wondering if this
Thread::Local(T)
type ain't deprecating the@[ThreadLocal]
annotation: it works everywhere and the annotation doesn't even spare a function call —the codegen marks it as noinline to avoid cache issues.